Spike: Pause compaction when out of sync #10392

ZenGround0 · 2023-03-05T03:16:11Z

Related Issues

I ended up playing with some of the chain sync contention protection ideas from #10388

Proposed Changes

Use head change out of sync detection to signal compaction to pause. Signal continue when chain is back in sync

Additional Info

@vyzo if you can take a look at this for feasibility it would be a big help.

Checklist

Before you mark the PR ready for review, please make sure that:

Commits have a clear commit message.
PR title is in the form of of <PR type>: <area>: <change being made>
- example: fix: mempool: Introduce a cache for valid signatures
- PR type: fix, feat, build, chore, ci, docs, perf, refactor, revert, style, test
- area, e.g. api, chain, state, market, mempool, multisig, networking, paych, proving, sealing, wallet, deps
New features have usage guidelines and / or documentation updates in
- Lotus Documentation
- Discussion Tutorials
Tests exist for new functionality or change in behavior
CI is green

vyzo

it is not unreasonable i think, but we should test it in mainnet node.

magik6k

Don't see anything massively wrong here, definitely would be good to test on a mainnet node.

magik6k · 2023-03-06T16:03:20Z

blockstore/badger/blockstore.go

@@ -396,6 +396,9 @@ func (b *Blockstore) doCopy(from, to *badger.DB) error {
 	if workers < 2 {
 		workers = 2
 	}
+	if workers > 8 {
+		workers = 8


Do we have confidence in this number? Did we check that it's "good enough" on mainnet?

Need to check on this. There's some evidence that it might be too high right now resulting in chain getting out of sync. Alternate strategy we can use is leave as is and see if chain sync issues persist correlated to moving GC.

@vyzo for reference do you have an estimate of a reasonable time for moving GC to complete today?

cc @TippyFlitsUK you might have an even better idea.

maybe 5-10 min.

lets take a fresh measurement?

For reference I got 50m on my fairly resourced mainnet node

^ with current default of half CPUs
I'll measure with 8 and compare when I get to similar garbage levels

working without major issues on mainnet (cc @TippyFlitsUK ) and this seems to be preventing out of sync on moving GC

One last thing to note: I'm suspicious that reducing goroutines here is the main thing causing improvement since moving GC is anecdotally the thing which causes chain sync to get left behind and we are not doing any yielding during move (yet) in this change. If we are concerned about risk of compaction deadlock we could just include the simple go routine limiting of moving GC.

blockstore/badger/blockstore.go

magik6k · 2023-03-06T16:27:10Z

blockstore/splitstore/splitstore_compact.go

+			// already out of sync, no signaling necessary
+
+		}
+		// TODO: ok to use hysteresis with no transitions between 30s and 1m?


When we're syncing, the head will only be taken when we get fully in sync, so this is probably fine

(well, really the sync new head is only selected when we complete the previous sync, but the effect is similar)

magik6k · 2023-03-31T08:00:27Z

Tests are mad

2023-03-30T18:59:10.196Z	INFO	chainstore	store/store.go:657	New heaviest tipset! [bafy2bzacebdx6sp2za7e6l3quy54ix5asgydurixiesjjehj6mcxclhnoeyzy bafy2bzacectqgvabrqfhsarubiqcfk3ykwmuvumwbm5vbc27ft65ia4gsrtoq] (height=50)
fatal error: sync: Unlock of unlocked RWMutex
�[31m///////////////////////////////////////////////////�[39b

goroutine 710 [running]:
sync.fatal({0x4a5ee70?, 0x57b1880?})
	/usr/local/go/src/runtime/panic.go:1031 +0x1e
sync.(*RWMutex).Unlock(0xc00e4c60fc)
	/usr/local/go/src/sync/rwmutex.go:209 +0x4a
sync.(*Cond).Wait(0xc00049a610?)
	/usr/local/go/src/sync/cond.go:69 +0x7e
github.com/filecoin-project/lotus/blockstore/splitstore.(*SplitStore).waitForSync(0xc00e4c6000)
	/home/circleci/lotus/blockstore/splitstore/splitstore_compact.go:924 +0xab
github.com/filecoin-project/lotus/blockstore/splitstore.(*SplitStore).checkYield(0xc00e4c6000)
	/home/circleci/lotus/blockstore/splitstore/splitstore_compact.go:930 +0x1e
github.com/filecoin-project/lotus/blockstore/splitstore.(*SplitStore).walkObjectIncomplete(0xc00bd1c0b0?, {{0xc00c7d56e0?, 0x0?}}, {0x7ff5be2d4388, 0xc00c7c0140}, 0xc00e304180, 0x543d498)
	/home/circleci/lotus/blockstore/splitstore/splitstore_compact.go:1225 +0x206
github.com/filecoin-project/lotus/blockstore/splitstore.(*SplitStore).walkChain(0xc00e4c6000, 0xc00e658300, 0x0, 0x33, {0x7ff5be2d4388?, 0xc00c7c0140}, 0xc00e304180, 0x543d488)
	/home/circleci/lotus/blockstore/splitstore/splitstore_compact.go:1079 +0x39b
github.com/filecoin-project/lotus/blockstore/splitstore.(*SplitStore).doWarmup(0xc00e4c6000, 0xc00e658300)
	/home/circleci/lotus/blockstore/splitstore/splitstore_warmup.go:71 +0x34e
github.com/filecoin-project/lotus/blockstore/splitstore.(*SplitStore).warmup.func1()
	/home/circleci/lotus/blockstore/splitstore/splitstore_warmup.go:38 +0x106
created by github.com/filecoin-project/lotus/blockstore/splitstore.(*SplitStore).warmup
	/home/circleci/lotus/blockstore/splitstore/splitstore_warmup.go:32 +0x98

blockstore/splitstore/splitstore_compact.go

Conditions always call "unlock" so we can't safely use the condition with both the read and write side of lock. So we might as well revert back to a regular lock. fixes #10616

Stebalien · 2023-04-05T16:07:13Z

@arajasek I don't think you meant to close this, right?

…re-oos Backport #10392 into v1.21.0

ZenGround0 · 2023-05-02T18:12:31Z

Closed by #10641 backported and merged back to master

vyzo reviewed Mar 5, 2023

View reviewed changes

magik6k reviewed Mar 6, 2023

View reviewed changes

Base automatically changed from feat/lotus-badger-gc to master March 7, 2023 07:52

ZenGround0 mentioned this pull request Mar 8, 2023

feat:splitstore:Configure max space used by hotstore and GC makes best effort to respect #10391

Merged

7 tasks

ZenGround0 force-pushed the feat/splitstore-gc-wait-for-sync branch 3 times, most recently from c1afd4e to d9acd45 Compare March 20, 2023 16:32

mostcute mentioned this pull request Mar 20, 2023

Blocking messages in mpool might cause lagging synchronization #10518

Open

11 tasks

jennijuju added the release/backport label Mar 30, 2023

ZenGround0 mentioned this pull request Mar 30, 2023

feat:splitstore:Update config default value #10605

Merged

7 tasks

ZenGround0 added 3 commits March 30, 2023 12:51

signal chain in and out of sync to compaction workers

6426c27

check yield before GC

ac9db5c

Implement yield friendly online GC

36d274a

ZenGround0 force-pushed the feat/splitstore-gc-wait-for-sync branch from d9acd45 to 36d274a Compare March 30, 2023 18:52

Stop swallowing errors

7a4082c

ZenGround0 marked this pull request as ready for review March 30, 2023 18:53

ZenGround0 requested a review from a team as a code owner March 30, 2023 18:53

ZenGround0 requested review from vyzo and magik6k March 30, 2023 18:53

ZenGround0 force-pushed the feat/splitstore-gc-wait-for-sync branch from ad0a54b to 7a4082c Compare March 30, 2023 20:56

magik6k reviewed Mar 31, 2023

View reviewed changes

blockstore/splitstore/splitstore_compact.go Outdated Show resolved Hide resolved

fix: revert chain sync mutex to a regular lock

68f402d

Conditions always call "unlock" so we can't safely use the condition with both the read and write side of lock. So we might as well revert back to a regular lock. fixes #10616

Stebalien force-pushed the feat/splitstore-gc-wait-for-sync branch from 0437b4e to 68f402d Compare April 4, 2023 18:29

ZenGround0 mentioned this pull request Apr 4, 2023

feat:splitstore:limit moving gc threads #10621

Merged

7 tasks

rjan90 mentioned this pull request Apr 5, 2023

Backport #10621 to v1.21.0 #10623

Merged

7 tasks

marco-storswift mentioned this pull request Apr 5, 2023

splitstore_compact fatal error: sync: Unlock of unlocked RWMutex #10616

Closed

11 tasks

arajasek closed this in #10621 Apr 5, 2023

Stebalien reopened this Apr 5, 2023

rjan90 mentioned this pull request Apr 10, 2023

Backport #10392 into v1.21.0 #10641

Merged

7 tasks

ZenGround0 added a commit that referenced this pull request Apr 10, 2023

Merge pull request #10641 from filecoin-project/phi/backport-splitsto…

5e89016

…re-oos Backport #10392 into v1.21.0

ZenGround0 closed this May 2, 2023

Stebalien removed the release/backport label Feb 21, 2024

rjan90 deleted the feat/splitstore-gc-wait-for-sync branch February 3, 2025 08:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spike: Pause compaction when out of sync #10392

Spike: Pause compaction when out of sync #10392

ZenGround0 commented Mar 5, 2023

vyzo left a comment

magik6k left a comment

magik6k Mar 6, 2023

ZenGround0 Mar 6, 2023

vyzo Mar 6, 2023

vyzo Mar 6, 2023

ZenGround0 Mar 7, 2023

ZenGround0 Mar 7, 2023

ZenGround0 Mar 30, 2023

ZenGround0 Mar 30, 2023

magik6k Mar 6, 2023

magik6k commented Mar 31, 2023

Stebalien commented Apr 5, 2023

ZenGround0 commented May 2, 2023

Spike: Pause compaction when out of sync #10392

Spike: Pause compaction when out of sync #10392

Conversation

ZenGround0 commented Mar 5, 2023

Related Issues

Proposed Changes

Additional Info

Checklist

vyzo left a comment

Choose a reason for hiding this comment

magik6k left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

magik6k commented Mar 31, 2023

Stebalien commented Apr 5, 2023

ZenGround0 commented May 2, 2023